Skip to content

Refactor/preprocessing/text normalization#21

Merged
ZenithClown merged 7 commits into
masterfrom
refactor/preprocessing/text-normalization
Oct 23, 2025
Merged

Refactor/preprocessing/text normalization#21
ZenithClown merged 7 commits into
masterfrom
refactor/preprocessing/text-normalization

Conversation

@ZenithClown
Copy link
Copy Markdown
Member

📜 Description

This PR brings the following change(s):

  • 🐛 Bug Fix - a non-breaking change that fixes an issue.
  • 🌟 New Feature - a non-breaking change that adds new functionality.
  • 🛠️ Breaking Change - can either be a fix or a feature that might change the functionality to not work as expected.
  • 📖 Documentation Change - changes or updates the documentation of a particular project, function, etc.

Fixes # (issue number)

✔️ Checks

  • Tests are added and passed for all relevant functions (if any).
  • My code follows the style guidelines as mentioned.
  • In case of new functionality, proper comments have been added to the code.
  • Changes do not generate any warnings for existing functions.
  • Add a new entry in documentation, if fixing a bug or added a new feature.

- grouping of similar functions into submodule is the logical approach
- normalization of text is part of pre-processing of raw texts
- new version uses the modular approach with .apply() method to clean texts of white space
- all the keyword arguments are internally processed to make the model and run the underlying function
- 📃 updated documentation of the model and function
- 💣removed deprecated function strip_whitespace() from method
- 💣 updated init-time optimization from the module
- 🚧 documentation and field validation pending
- added extra words options to be removed, fixes #13
- word tokenization and stop words removal are now in one modular method
- 💣 this deprecates internal nlpurify/feature/selection/nltk.py methods
- added attribute control to check stop words with desired case folding (upper/lower) as per final string's case folding requirements
- added example in jupyter notebooks
- added preprocessing utility methods
- modularize word tokenization in stop words selection
@ZenithClown ZenithClown merged commit 7df63f0 into master Oct 23, 2025
5 of 6 checks passed
@ZenithClown ZenithClown deleted the refactor/preprocessing/text-normalization branch October 23, 2025 10:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant